Part-of-Speech Tagging for English-Spanish Code-Switched Text

نویسندگان

  • Thamar Solorio
  • Yang Liu
چکیده

Code-switching is an interesting linguistic phenomenon commonly observed in highly bilingual communities. It consists of mixing languages in the same conversational event. This paper presents results on Part-of-Speech tagging Spanish-English code-switched discourse. We explore different approaches to exploit existing resources for both languages that range from simple heuristics, to language identification, to machine learning. The best results are achieved by training a machine learning algorithm with features that combine the output of an English and a Spanish Partof-Speech tagger.

منابع مشابه

POS Tagging of Hindi-English Code Mixed Text from Social Media: Some Machine Learning Experiments

We discuss Part-of-Speech(POS) tagging of Hindi-English Code-Mixed(CM) text from social media content. We propose extensions to the existing approaches, we also present a new feature set which addresses the transliteration problem inherent in social media. We achieve an 84% accuracy with the new feature set. We show that the context and joint modeling of language detection and POS tag layers do...

متن کامل

Part of Speech Tagging for Code Switched Data

We address the problem of Part of Speech tagging (POS) in the context of linguistic code switching (CS). CS is the phenomenon where a speaker switches between two languages or variants of the same language within or across utterances, known as intra-sentential or inter-sentential CS, respectively. Processing CS data is especially challenging in intrasentential data given state of the art monoli...

متن کامل

Crowdsourcing Universal Part-of-Speech Tags for Code-Switching

Code-switching is the phenomenon by which bilingual speakers switch between multiple languages during communication. The importance of developing language technologies for codeswitching data is immense, given the large populations that routinely code-switch. High-quality linguistic annotations are extremely valuable for any NLP task, and performance is often limited by the amount of high-qualit...

متن کامل

A Text Processing Tool for the Romanian Language

BALIE is a multilingual text processing tool designed to support information extraction. In this paper we explain how we adapted it to work for the Romanian language. With this addition, the tool supports five languages: English, French, German, Spanish, and Romanian. The services offered by the tool are: language identification, tokenization, sentence boundary detection, and part-of–speech tag...

متن کامل

SMPOST: Parts of Speech Tagger for Code-Mixed Indic Social Media Text

Use of social media has grown dramatically fast during the past few years. Users usually follow informal languages in communicating through social media. This language of communication is often mixed in nature, where people transcribe their regional language with English. This technique of writing is increasing its popularity rapidly. Natural language processing (NLP) aims to infer the informat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل
عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008